The use of remote sensing data to predict water quality variables that are optically active (i.e., interacts with light) has been applied to ocean and coastal waters for ~50 years. Thanks to the new generation of sensors with adequate spectral, radiometric and spatial resolution (i.e., Landsat, Sentinel-2, etc) in the last 15 years the community started to use RS to freshwater studies.
Remote sensing allow us to predict some WQ parameters: Suspended sediments, chlorophyl-a, phycocianin, dissolved organic matter, carbon, secchi disk depth, turbidity…It is an important source of data that could help biologists, limnologists, and all the aquatic science community into understanding of water pattern.
In this Notebook, we gonna learn how to use Remote Sensing data to predict water quality variables (Chl-a and Suspended Sediments). For that, we gonna use a hyperspectral dataset for global freshwater (GLORIA) to simulated Landsat and Sentinel-2 bands (i.e., makes the in situ data the “same” as satellite see). After that, we gonna do some exploratory analysis and filters to remove outliers and develop a empirical and Machine Learning algorithm (Random Forest) to predict Chl-a and TSS for satellite images.
The processing flow is divided into three topics:
require(data.table)
require(dplyr)
require(rstac)
require(terra)
require(mapview)
require(httr)
require(Metrics)
require(geodrawr)
require(svDialogs)
require(rstac)
require(wesanderson)
require(PerformanceAnalytics)
The GLORIA dataset is a compilation of remote sensing and water quality data for global waters, with dedicated data for freshwater ecosystems. It is free and available for everyone, and covers most part of the globe with more than 7,000 samples (Figure 01)
Figure 01
For more information, users are refered to the publication (Lehmann et al. 2023), the dataset in PANGAEA and the Nature Earth and Environmment blog post
In the following codes, we are gonna:
# Let's download the GLORIA to a empty folder called "Data" in our project
## First, create directories
dir.create("Data")
dir.create("Outputs")
dir.create("Scripts")
## URL
URL = 'https://download.pangaea.de/dataset/948492/files/GLORIA-2022.zip'
# Before download, let`s set timeout to 200s (sometimes PANGAEA download is slow)
options(timeout=300)
# If the directory with files doesn't exist, let's download GLORIA data.
if(dir.exists('Data/GLORIA_2022/') == FALSE) {
# Download
download.file(URL, 'Data/GLORIA.zip')
# Extract
unzip(zipfile = 'Data/GLORIA.zip', exdir = 'Data')
}
After downloading the data, we can explore it. It is composed of various sheets, such as:
We reccomend the user to open the table in Excel or Google Docs and explore it. In this exercise, we gonna use the following tables:
Let’s remember that Remote Sensing Reflectance is the ratio between water leaving radiance and downwelling irradiance, compensated by the sky radiance and corrected by glint effects (Equation 01).
\[\begin{equation} R_{rs} = (L_t - (L_{sky}*\rho))/E_s \end{equation}\]
These tables contain all the necessary variables we need to produce the desired algorithms. Let`s now open and explore it. You gonna see that the collumn GLORIA_ID contains the ID of each unique station. This collumn will be used in the next steps to merge both tables.
## See the dataset
# META and Lab data
meta_and_lab = fread("Data/GLORIA_2022/GLORIA_meta_and_lab.csv")
head(meta_and_lab)
# Rrs data
rrs = fread("Data/GLORIA_2022/GLORIA_Rrs.csv")
head(rrs)
However, this is a hiperspectral dataset. If we want to develop RS models, we will need to simulate (i.e. integrate based on the Relative Spectral Response) each band of the desiered sensor. When we simulate a satellite band, we are compensating for the differences in detector sensibility to each wavelength. The Figure below shows differences in the spectral response function for Sentinel-2A/MSI, Landsat-8/OLI and Landsat-7/ETM+. You can note that relative spectral response values close to “1” indicates that the detector can measure (or detect) all the radiance in this wavelength. A sensor band is composed by an interval of these wavelengths and then, the simulated band is the integral the R[rs] considering the Relative Spectral Response curve, or Equation 02.
Figure 02
\[\begin{equation} R_{rs} = \frac{\int_{n}^{m} SRF(\lambda) R_{rs} dx}{\int_{n}^{m} SRF(\lambda)} \end{equation}\]
To do that, we gonna use the bandSimulation package available in GitHub.
devtools::install_github("dmaciel123/BandSimulation")
## Skipping install of 'bandSimulation' from a github remote, the SHA1 (88ed39b8) has not changed since last install.
## Use `force = TRUE` to force installation
require(bandSimulation)
## Loading required package: bandSimulation
## Simulation for Sentinel-2 MSI sensor
spectra_formated = select(rrs, paste("Rrs_", 400:900, sep = '')) %>% t()
head(spectra_formated[1:10,1:10])
## [,1] [,2] [,3] [,4] [,5]
## Rrs_400 0.001032099 0.0009919781 0.001176673 0.001045943 0.001123216
## Rrs_401 0.001026443 0.0009886549 0.001172247 0.001044805 0.001121460
## Rrs_402 0.001025443 0.0009907777 0.001173282 0.001049251 0.001125342
## Rrs_403 0.001042929 0.0010105485 0.001195316 0.001072244 0.001149169
## Rrs_404 0.001032327 0.0009995005 0.001184033 0.001062849 0.001139232
## Rrs_405 0.001033530 0.0010029657 0.001187121 0.001068868 0.001144698
## [,6] [,7] [,8] [,9] [,10]
## Rrs_400 0.001262030 0.001502713 0.001159079 0.001598394 0.002235938
## Rrs_401 0.001260533 0.001500229 0.001153895 0.001589690 0.002231068
## Rrs_402 0.001264954 0.001506565 0.001155018 0.001587384 0.002237669
## Rrs_403 0.001291754 0.001539501 0.001176772 0.001613440 0.002284503
## Rrs_404 0.001281843 0.001521784 0.001162394 0.001597356 0.002260299
## Rrs_405 0.001288210 0.001528472 0.001164688 0.001597279 0.002268107
MSI_sim = msi_simulation(spectra = spectra_formated,
point_name = rrs$GLORIA_ID)
## Loading required package: openxlsx
## Loading required package: tidyr
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:terra':
##
## extract
#It simulates for Sentinel-2A and Sentinel-2B and gives the results in a list.
# Let's select only Sentinel-2A.
MSI = MSI_sim$s2a[,-1] %>% t() %>% data.frame()
head(MSI[,1:9])
# Add names to a collumn
MSI$GLORIA_ID = row.names(MSI)
#Check final
#head(MSI[,1:9])
# Change band names
names(MSI) = c('x440', "x490", 'x560', 'x660', "x705", 'x740', 'x780', 'x850', 'x865', "GLORIA_ID")
ROW = 150
# Example
matplot(t(select(rrs, paste("Rrs_", 400:900, sep = ''))[ROW,]), ylim = c(0,0.015),
x= c(400:900), pch = 20, xlab = '', ylab = '')
par(new=T)
matplot(t(MSI[ROW,-10]), x= c(440,490,560,660,705,740,780,850,860), pch = 20,
ylim = c(0,0.015), xlim = c(400,900), col = 'red', cex = 2, xlab = 'Wavelength (nm)',
ylab = 'Rrs (sr-1)')
## Merge with water quality, lat long (By GLORIA_ID)
merged = merge(select(meta_and_lab, c('GLORIA_ID', 'Chla', "TSS", "Latitude", "Longitude")),
MSI, by = "GLORIA_ID")
head(merged)
vector = vect(merged,
geom = c('Longitude', 'Latitude'),
"EPSG:4326")
mapview(sf::st_as_sf(vector), zcol = 'TSS')
## Warning in validateCoords(lng, lat, funcName): Data contains 127 rows with
## either missing or invalid lat/lon values and will be ignored
Before continuing and saving the simulated data, we will need to remove No-data values, perform some filter process (e.g., remove very low values of Chl-a) and also calculate some indexes that could help into modeling process.
Let’s do that.
# We want to model TSS and Chla concentration based on RF
merged = merged[is.na(merged$Chla) == FALSE, ]
merged = merged[is.na(merged$TSS) == FALSE, ]
#head(merged)
## Index calculation:
merged$NDCI = (merged$x705-merged$x660)/(merged$x705+merged$x660)+1
merged$Blue_red = (merged$x490-merged$x660)/(merged$x490+merged$x660)+1
merged$green_blue = (merged$x560-merged$x490)/(merged$x560+merged$x490)+1
merged$nir_red = (merged$x740-merged$x660)/(merged$x740+merged$x660)+1
# Check dimension
dim(merged)
## [1] 3203 18
Performing some exploratory analyzes.
chart.Correlation(merged[,-c(1, 4,5)])